Supplementary Material: Cheap Bandits
ثبت نشده
چکیده
Proof: We prove the lemma by the explicit construction of a graph. Consider a graph G consisting of d disjoint connected subgraphs denoted as Gj : j = 1, 2 . . . , d. Let the nodes in each subgraph have the same reward. The eigenvalues of the graph are {0, λ̂1, · · · , λ̂N−d}, where eigenvalue 0 is repeated d times. Note that the set of eigenvalues of the graph is the union of the set of eigenvalues of the individual subgraphs. Without loss of generality, assume that λ̂1 > T/d log(T/λ+1). This is always possible, for example if subgraphs are cliques, which is what we assume. Then the effective dimension of the graph G is d. Since the graph separates into d disjoint subgraphs, we can split the reward function fα = Qα into d parts, one corresponding to each subgraph. We write f j = Qjαj for j = 1, 2, . . . , d, where f j is the reward function associated with Gj , Qj is the orthonormal matrix corresponding to Laplacian of Gj , and αi is a sub-vector of α corresponding to node rewards on Gj . Write αj = Q ′ jf j . Since f j is a constant vector, and except for one, all the columns in Qj are orthogonal to f j , it is clear that αj has only one non-zero component. We conclude that for the reward functions that is constant on each subgraphs α has only d non-zero components and is in a d-dimensional space. The proof of the lemma is completed by setting Ĝ = G. Note that a graph with effective dimension d cannot have more than d disjoint connected subgraphs. Next, we restrict our attention to graph Ĝ and rewards that are piecewise constant on each clique. That means that the nodes in each clique have the same reward. Recall that action set SD consists of actions that can probe a node or a group of neighboring nodes. Therefore, any group action will only allow us to observe average reward from a group of nodes within a clique but not across the cliques. Then, all node and group actions used to observe reward from within a clique are indistinguishable. Hence, the SD collapses to set of d distinct actions one associated with each clique, and the problem reduces to that of selecting a clique with the highest reward. We henceforth treat each clique as an arm where all the nodes within the same clique share the same reward value.
منابع مشابه
Material to “ Online Clustering of Bandits ”
This supplementary material contains all proofs and technical details omitted from the main text, along with ancillary comments, discussion about related work, and extra experimental results. 1. Proof of Theorem 1 The following sequence of lemmas are of preliminary importance. The first one needs extra variance conditions on the process X generating the context vectors. We find it convenient to...
متن کاملThe Price of Differential Privacy for Online Learning(with Supplementary Material)
We design differentially private algorithms for the problem of online linear optimization in the full information and bandit settings with optimal Õ( √ T )1 regret bounds. In the full-information setting, our results demonstrate that ε-differential privacy may be ensured for free – in particular, the regret bounds scale as O( √ T ) + Õ ( 1 ε ) . For bandit linear optimization, and as a special ...
متن کاملSupplementary Material on Reputational Cheap Talk
In Ottaviani and Sørensen, henceforth OS, (2004b), we have formulated a model of strategic communication by an expert concerned about being perceived to be well informed. In that model, the expert observes a private signal informative about the state of the world. The amount of information about the state contained in this signal is parametrized by the expert's ability, assumed for simplicity t...
متن کاملMaterial for : Batched Bandit Problems
Motivated by practical applications, chiefly clinical trials, we study the regret achievable for stochastic bandits under the constraint that the employed policy must split trials into a small number of batches. We propose a simple policy, and show that a very small number of batches gives close to minimax optimal regret bounds. As a byproduct, we derive optimal policies with low switching cost...
متن کاملCheap Bandits
We consider stochastic sequential learning problems where the learner can observe the average reward of several actions. Such a setting is interesting in many applications involving monitoring and surveillance, where the set of the actions to observe represent some (geographical) area. The importance of this setting is that in these applications, it is actually cheaper to observe average reward...
متن کامل